Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Nat Genet ; 26(2): 225-8, 2000 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-11017083

RESUMO

Elucidating the human transcriptional regulatory network is a challenge of the post-genomic era. Technical progress so far is impressive, including detailed understanding of regulatory mechanisms for at least a few genes in multicellular organisms, rapid and precise localization of regulatory regions within extensive regions of DNA by means of cross-species comparison, and de novo determination of transcription-factor binding specificities from large-scale yeast expression data. Here we address two problems involved in extending these results to the human genome: first, it has been unclear how many model organism genomes will be needed to delineate most regulatory regions; and second, the discovery of transcription-factor binding sites (response elements) from expression data has not yet been generalized from single-celled organisms to multicellular organisms. We found that 98% (74/75) of experimentally defined sequence-specific binding sites of skeletal-muscle-specific transcription factors are confined to the 19% of human sequences that are most conserved in the orthologous rodent sequences. Also we found that in using this restriction, the binding specificities of all three major muscle-specific transcription factors (MYF, SRF and MEF2) can be computationally identified.


Assuntos
Genoma Humano , Camundongos/genética , Sequências Reguladoras de Ácido Nucleico , Algoritmos , Animais , Sequência de Bases , Sequência Consenso , Regulação da Expressão Gênica , Humanos , Modelos Genéticos , Alinhamento de Sequência , Transcrição Gênica
2.
Genome Res ; 10(10): 1631-42, 2000 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-11042160

RESUMO

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.


Assuntos
Biologia Computacional/métodos , DNA/química , DNA/genética , Genes/genética , Composição de Bases , Cromossomos Artificiais/química , Cromossomos Artificiais/genética , Humanos , Reprodutibilidade dos Testes , Software
3.
Curr Opin Biotechnol ; 11(1): 19-24, 2000 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-10679343

RESUMO

A complex network of regulatory controls governs the patterns of gene expression. Enabled by the tools of molecular cloning, initial experimental queries into the gene regulatory network elucidated a wide array of transcription factors and their cognate binding sites from hundreds of genes. The recent fusion of genome-scale experimental tools, a more comprehensive gene catalog, and concomitant advances in computational methodology, has extended the range of questions being posed. The potential to further our understanding of the biochemical mechanisms of transcriptional regulation and to accelerate the delineation of regulatory control regions in the human genome is enormous.


Assuntos
Biologia Computacional , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo , Transcrição Gênica/genética , Animais , Sequência de Bases , Sítios de Ligação , Pegada de DNA , Proteínas de Ligação a DNA/metabolismo , Humanos , Filogenia , Regiões Promotoras Genéticas/genética
4.
Genome Res ; 9(12): 1288-93, 1999 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-10613851

RESUMO

Alternative splicing can produce variant proteins and expression patterns as different as the products of different genes, yet the prevalence of alternative splicing has not been quantified. Here the spliced alignment algorithm was used to make a first inventory of exon-intron structures of known human genes using EST contigs from the TIGR Human Gene Index. The results on any one gene may be incomplete and will require verification, yet the overall trends are significant. Evidence of alternative splicing was shown in 35% of genes and the majority of splicing events occurred in 5' untranslated regions, suggesting wide occurrence of alternative regulation. Most of the alternative splices of coding regions generated additional protein domains rather than alternating domains.


Assuntos
Processamento Alternativo , Regiões 5' não Traduzidas/genética , Sequência de Bases/genética , Mapeamento de Sequências Contíguas/métodos , Bases de Dados Factuais , Éxons/genética , Etiquetas de Sequências Expressas , Humanos , Íntrons/genética , Dados de Sequência Molecular , Isoformas de Proteínas/genética , Alinhamento de Sequência
5.
Nucleic Acids Res ; 27(17): 3577-82, 1999 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-10446249

RESUMO

With the growing number of completely sequenced bacterial genes, accurate gene prediction in bacterial genomes remains an important problem. Although the existing tools predict genes in bacterial genomes with high overall accuracy, their ability to pinpoint the translation start site remains unsatisfactory. In this paper, we present a novel approach to bacterial start site prediction that takes into account multiple features of a potential start site, viz., ribosome binding site (RBS) binding energy, distance of the RBS from the start codon, distance from the beginning of the maximal ORF to the start codon, the start codon itself and the coding/non-coding potential around the start site. Mixed integer programing was used to optimize the discriminatory system. The accuracy of this approach is up to 90%, compared to 70%, using the most common tools in fully automated mode (that is, without expert human post-processing of results). The approach is evaluated using Bacillus subtilis, Escherichia coli and Pyrococcus furiosus. These three genomes cover a broad spectrum of bacterial genomes, since B.subtilis is a Gram-positive bacterium, E.coli is a Gram-negative bacterium and P. furiosus is an archaebacterium. A significant problem is generating a set of 'true' start sites for algorithm training, in the absence of experimental work. We found that sequence conservation between P. furiosus and the related Pyrococcus horikoshii clearly delimited the gene start in many cases, providing a sufficient training set.


Assuntos
Códon de Iniciação , Genoma Bacteriano , Biossíntese de Proteínas , Algoritmos , Sequência de Aminoácidos , Bacillus subtilis/genética , Sequência Conservada , Escherichia coli/genética , Dados de Sequência Molecular , Pyrococcus furiosus/genética , Homologia de Sequência de Aminoácidos
7.
J Mol Biol ; 278(1): 167-81, 1998 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-9571041

RESUMO

For many newly sequenced genes, sequence analysis of the putative protein yields no clue on function. It would be beneficial to be able to identify in the genome the regulatory regions that confer temporal and spatial expression patterns for the uncharacterized genes. Additionally, it would be advantageous to identify regulatory regions within genes of known expression pattern without performing the costly and time consuming laboratory studies now required. To achieve these goals, the wealth of case studies performed over the past 15 years will have to be collected into predictive models of expression. Extensive studies of genes expressed in skeletal muscle have identified specific transcription factors which bind to regulatory elements to control gene expression. However, potential binding sites for these factors occur with sufficient frequency that it is rare for a gene to be found without one. Analysis of experimentally determined muscle regulatory sequences indicates that muscle expression requires multiple elements in close proximity. A model is generated with predictive capability for identifying these muscle-specific regulatory modules. Phylogenetic footprinting, the identification of sequences conserved between distantly related species, complements the statistical predictions. Through the use of logistic regression analysis, the model promises to be easily modified to take advantage of the elucidation of additional factors, cooperation rules, and spacing constraints.


Assuntos
Regulação da Expressão Gênica , Músculo Esquelético/metabolismo , Sequências Reguladoras de Ácido Nucleico , Fatores de Transcrição/metabolismo , Sítios de Ligação , Pegada de DNA , Teste de Complementação Genética , Genoma , Computação Matemática , Modelos Moleculares , Filogenia , Fatores de Transcrição/genética
10.
Trends Genet ; 12(8): 316-20, 1996 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-8783942

RESUMO

Discovering new genes, and their functions, can be aided not only by special purpose gene (and coding region) finding software, but also by searches in key databases, and by programs for finding particular sites relevant to gene expression, such as promoters and splice sites. No one software package includes all the necessary tools. I describe here the main kinds of tools; their working principles, strengths and limitations; and how combined evidence from multiple tools can aid in optimum gene identification.


Assuntos
Biologia Computacional , Bases de Dados Factuais , Genes , Sequência de Aminoácidos , Animais , Sequência de Bases , Códon , DNA/química , Éxons , Humanos , Dados de Sequência Molecular , Sequências Repetitivas de Ácido Nucleico , Software
11.
Gene ; 172(1): GC19-32, 1996 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-8654964

RESUMO

The MEF2 and MyoD families of transcriptional regulatory factors both play central roles in the terminal differentiation of skeletal muscle. Further, binding sites for the two families often occur nearby, and there have been a number of indications that members of the two families may bind coordinately. The present study provides evidence that known binding sites for the two occur with precise geometric restrictions related to the DNA helical repeat unit, that pairs of putative sites following these restrictions are indicative of skeletal muscle-specific transcriptional regulatory regions, and that the geometric relationship can help provide a consistent interpretation for data that has until now been difficult to explain.


Assuntos
Proteínas de Ligação a DNA/metabolismo , Miogenina/metabolismo , Fatores de Transcrição/metabolismo , Animais , Sequência de Bases , Sítios de Ligação , Evolução Biológica , Sequência Conservada , Proteínas de Ligação a DNA/genética , Elementos Facilitadores Genéticos , Humanos , Fatores de Transcrição MEF2 , Dados de Sequência Molecular , Fatores de Regulação Miogênica , Miogenina/genética , Oligodesoxirribonucleotídeos , Fatores de Transcrição/genética , Transcrição Gênica
12.
Comput Chem ; 20(1): 103-18, 1996 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-16749184

RESUMO

The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher eukaryotes. Thus it is not surprising that the number of algorithm and software developers working in the area is rapidly increasing. The present paper is an overview of the field, with an emphasis on eukaryotes, for such developers.


Assuntos
Genes/genética , Sequência de Bases/genética , Códon/genética , Éxons/genética , Expressão Gênica/genética , Modelos Genéticos , Homologia de Sequência
13.
Mol Cell Biol ; 16(1): 437-41, 1996 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-8524326

RESUMO

Myocyte-specific enhancer factor 2 (MEF2) is a family of closely related transcription factors that play a key role in the differentiation of muscle tissues and are important in the muscle-specific expression of a number of genes. Given the centrality of MEF2 in muscle differentiation, regulatory regions newly determined to be muscle specific are often studied for potential MEF2 binding sites. Possible sites are often located by comparison to a homologous gene or by matching to the consensus MEF2 sequence. Enough data have accumulated that a richer description of the MEF2 binding site, a position weight matrix, can be reliably constructed and its usefulness can be assessed. It was shown that scores from such a matrix approximate MEF2 binding energy and enable recognition of naturally occurring MEF2 sites with high sensitivity and specificity. Regulation of genes via MEF2-like sites is complicated by the fact that a number of transcription factors are involved. Not only is MEF2 itself a family of proteins, but several other, nonhomologous, transcription factors overlap MEF2 in DNA-binding specificity. Thus, more quantitative methods for recognizing potential sites may help with the lengthy process of disentangling the complex regulatory circuits of muscle-specific expression.


Assuntos
Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Sequência de Aminoácidos , Animais , Sítios de Ligação/genética , Biometria , DNA/metabolismo , Humanos , Fatores de Transcrição MEF2 , Dados de Sequência Molecular , Músculos/metabolismo , Mutagênese Sítio-Dirigida , Fatores de Regulação Miogênica
14.
J Mol Biol ; 253(1): 51-60, 1995 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-7473716

RESUMO

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.


Assuntos
Sequência de Bases/genética , DNA/genética , Genes/genética , Algoritmos , Composição de Bases , Bases de Dados Factuais , Análise Discriminante , Humanos , Fases de Leitura Aberta/genética , Proteínas/genética
15.
J Comput Biol ; 2(1): 117-23, 1995.
Artigo em Inglês | MEDLINE | ID: mdl-7497114

RESUMO

The length of an open reading frame (ORF) is one important piece of evidence often used in locating new genes, particularly in organisms where splicing is rare. However, there have been no systematic studies quantifying the degree of correlation between length of ORF, on the one hand, and likelihood of gene function, on the other. In this paper, techniques are derived to estimate the conditional probability of gene function, given ORF length, based on evidence both from the databases and from simulation. Several complete chromosomes of Saccharomyces cerevisiae have now been sequenced, and considerable effort is being expended on locating and characterizing the genes in these sequences. Thus, we illustrate the techniques for this organism.


Assuntos
Cromossomos Fúngicos , Bases de Dados Factuais , Genes , Fases de Leitura Aberta , Saccharomyces cerevisiae/genética , Sequência de Aminoácidos , Sequência de Bases , Proteínas Fúngicas/química , Proteínas Fúngicas/genética , Biossíntese de Proteínas , Splicing de RNA
16.
Comput Chem ; 18(3): 203-5, 1994 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-7952890

RESUMO

One expects that in DNA without protein coding function, stop codons (which constitute three of the 64 possible codons) should occur frequently in all reading frames, and that a long open reading frame (ORF) can be interpreted as a sign for the existence of a gene. We make a beginning on introducing quantitative measures of confidence into this inference--taking Saccharomyces cerevisiae as a sample case--and show that some common assumptions can reasonably be questioned. In particular we show that statistical support for the biological function of shorter ORFs listed as putative genes in recent papers is in fact very weak. This is an issue of practical as well as theoretical interest, since researching the function of a putative gene is difficult and expensive.


Assuntos
Genes , Fases de Leitura Aberta , Composição de Bases , Cromossomos Artificiais de Levedura , DNA Fúngico/genética , Genes Fúngicos , Modelos Genéticos , Saccharomyces cerevisiae/genética
17.
Nucleic Acids Res ; 21(12): 2837-44, 1993 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-8332493

RESUMO

A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C. elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.


Assuntos
Composição de Bases , Códon , DNA/química , Proteínas/genética , Animais , Caenorhabditis elegans/genética , Cosmídeos , DNA/análise , Genes Fúngicos , Humanos , Análise de Sequência de DNA , Estatística como Assunto
18.
Nucleic Acids Res ; 20(24): 6441-50, 1992 Dec 25.
Artigo em Inglês | MEDLINE | ID: mdl-1480466

RESUMO

A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.


Assuntos
Sequência de Bases , DNA/genética , Genes , Técnicas Genéticas , Proteínas/genética , Algoritmos , Composição de Bases , Códon/genética , Éxons , Análise de Fourier , Humanos
19.
Genomics ; 13(4): 1056-64, 1992 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-1505943

RESUMO

We model the base compositional structure of the human and Escherichia coli genomes. Three particular properties are first quantified: (1) There is a significant tendency for any region of either genome to have a strand-symmetric base composition. (2) The variation in base composition from region to region, within each genome, is very much larger than expected from common homogeneous stochastic models. (3) A given local base composition tends to persist over a scale of at least kilobases (E. coli) or tens of kilobases (human). Multidomain stochastic models from the literature are reviewed and sharpened. In particular, quantitative measurements of the third property lead us to suggest a significant shift in the style of domain models, in which the variation of A+T content with position is modeled by a random walk with frequent small steps rather than with large quantum jumps. As an application, we suggest a way to reduce the amount of computation in the assembly of large sequences from sequences of randomly chosen fragments.


Assuntos
Escherichia coli/genética , Genoma Bacteriano , Genoma Humano , Humanos
20.
Biotechniques ; 10(6): 764-7, 1991 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-1878210

RESUMO

SCORE, a program for computer-assisted scoring of Southern blots of clone DNA, retains the use of expert human judgment while taking over much of the drudgery of the scoring task. The primary functions of the program are to help make an aligned overlay of the fluorescence gel image and the autoradiogram blot image, to keep track of band and lane locations and to store the resulting data directly into a database. Use of SCORE has resulted in greatly increased efficiency and accuracy.


Assuntos
Southern Blotting , Software , Autorradiografia , Mapeamento Cromossômico/métodos , Impressões Digitais de DNA/métodos , Eletroforese em Gel de Ágar , Humanos , Processamento de Imagem Assistida por Computador/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...